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1. INTRODUCTORY DESCRIPTION 


1.1 ASSUMPTIONS AND PROBLEM DEFINITION 

The fundamental mathematical assumption underlying CLASSY is that the data 
may be usefully approximated by a mixture of multivariate normal densities. 
That is, if p is probability and x is an observation vector, 

m 

p(x|m,n) * 2 a.p.(x|y.,I,) (1) 

i=l 1 1 11 

where 

a.j = the a priori probability of occurrence of class i 

P-jUlu.,^) = the multivariate normal probability density function for 
class i 

m ■ the total number of classes 

•n = the full set of parameters 

= v’-v z i**"*y 

Given a set of unlabeled sample vectors {x^.}, we may form the likelihood func- 
tion in the following manner. 


N r 


L({Xj}|m 


,.i. TT 
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3=1 




where N = the total number of samples. 


( 2 ) 


So far, the assumptions and equations parallel the usual maxi mum- likelihood 
development. CLASSY makes the additional assumption that each value of the 
parameters m and tt occurs with an a priori probability A(m,Tr). The objective 
of CLASSY is to determine the discrete parameter m and the continuous param- 
eter vector 77 to maximize the following function. 
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L({Xj},m,7r) - A(m,Tr) 


(Xj | ) 


(3) 


n r 

tt 

j=i 


m 

S 

i=l 


A(m,7r ) must be chosen so that It satisfies the normalization constraint 
m 


f A(m,Tr 

m=l J 


)dir = 1 


(4) 


Typically, in the absence of other information, the a priori probabilities 
may be chosen as 


m 

A ( m , it ) = / /T C. 

i=l 1 


(5) 


where C i = a constant containing normalization factors over tt. With this 
choice for A(m,7r), the function to be maximized becomes 



where d = dimensionality of the samples. 


( 6 ) 


1.2 SOLUTION PROCEDURE 

Many approaches may be taken in maximizing eq. (4). The approach chosen in 
CLASSY is to interleave the maximum- likelihood iteration (designed to maxi- 
mize L((xj } ,m,ir) with respect to the continuous parameter vector tt) with a 
discrete split, join, and combine process (designed to maximize L({Xj},m,Tr) 
with respect to the discrete parameter m). It is expected that by alternat- 
ing these two techniques, values of m and tt corresponding to at least a local 
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maximum of L ( { > ,m,7r) will be determined. Since the splitting and combining 
techniques operate around each existing cluster, only in special cases will 
this local maximum fail to be global. 

The data are first scrambled to ensure that a true random sample is obtained. 
This is especially important in the CLASSY algorithm since any correlation 
in the data may cause the maximum-likelihood procedure to converge very 
slowly or experience cyclic drifts. The initial values assumed are 

m = 1 \ 

a i ■ 1 


The data are then examined point by point, and the parameter vector tt is up- 
dated according to the maximum likelihood equations which may be expressed 
as follows (ref. 1 ). 


p(i Ixk.tt) = 


a i p j( x kl u i’ £ l) 

m 

E a iPi (* k IV z i> 


= M P(1 


‘1 N 
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( 10 ) 



£p(i|x k ,tr) 
k=l K 


where p(i|x^tir) is the posterior probability of class i, given the k th sample 
vector and value of the parameters, and the primes refer to new or updated 
values for the parameters. 

This same technique is applied to the accumulation of the third- and fourth- 
order moments and the logarithm likelihood for each cluster. These statistics 
are used to test the fit of the hypothesized distribution to the data. 

As each point is considered, the probability that it belongs to each class is 
computed. These probabilities, which may be thought of as the fractional part 
of each data point assigned to each cluster, are accumulated as the "weights" 
for each cluster (eq. (8)). When the weight for a given cluster exceeds a 
threshold value (which increases each time it is exceeded), the program checks 
the likelihood ratio and the fit of the normal distribution to the data for 
that cluster. Old data (accumulated usirg less accurate parameter values) is 
also subtracted from each parameter sum ^cumulation at this time. 

The fit of the hypothesized normal distribution to the data for a cluster is 
evaluated by examining the third- and fourth-order moments about the mean 
for that cluster, which represent measures of skewness and kurtosis. The 
statistics which are generated are given by 

S, ■ (sV’s) (12) 
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where 

S = the skewness vector (trace of the rank 3 skewness tensor using the 
inverse covariance) 

S.j = a scalar measure of skewness 

S T = transpose of S 

>”"* = the inverse covariance matrix 

K 1 = Tr(KE _1 ) (13) 

where 

K * matrix of kurtosis values (trace of the rank 4 kurtosis tensor) 

K-|,K 2 ,*** = scalar measures of kurtosis 

K 2 = Tr(KS" 1 KZ" 1 ) - l [TrO^K )] 2 (14) 

In CLASSY, these three statistics are tested against their approximate sam- 
pling distributions computed under the hypothesis that the samples were 
drawn from the normal distribution specified by the current values of the 
parameters. If any one of these three statistics exceeds the threshold 
value, the hypothesis is generated that the cluster may be split into two 
parts. The parameters for each of the two new component clusters are esti- 
mated by minimizing the difference between the observed covariance matrix, 
the skewness vector, and the kurtosis matrix, and the corresponding quantities 
for the mixture distribution composed of the two new normal distributions. 

Following the generation of a split hypothesis, the parent cluster is not 
discarded immediately. When the maximum-likelihood iteration cycle is begun 
again, it is carried out for the previously existing clusters, including the 
parent cluster and the new subclusters (with the new parameters and an ini- 
tial weight, which is currently set to 40 points for each cluster). Thus, a 
hierarchial structure or cluster tree evolves as this process is repeated. 
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At the same time In the processing that a cluster Is checked to see If It 
may need to be split, certain other tests are performed. If a cluster has 
subclusters (l.e., has been previously split), It Is not split again; but 
the likelihood ratio of the daughter clusters to the parent cluster Is 
examined. If this ratio Is larger than a given threshold, the parent clus- 
ter Is eliminated and the daughter clusters take Its place (l.e., the hypothe 
sis that the parent Is split Is accepted). On the other hand. If the 
ratio Is too small, the daughter clusters are eliminated In favor of the 
parent (l.e., the hypothesis that the parent is split Is rejected). In addi- 
tion, a cluster may be eliminated If Its prior probability becomes too small, 
which may occur If another cluster has "taken" all Its points. The program 
also checks the degree of overlap between clusters at the same level in the 
cluster tree. If the degree of overlap is too great and the two clusters 
are not the only subclusters of a given parent cluster, the hypothesis that 
these are the same cluster is raised by joining the two similar clusters. 

The new cluster is given parameters which are a combination of the clusters 
joined to form It. All of these are tests restructuring the cluster tree at 
certain Intervals; namely, when the weight (or number of points assigned to a 
given cluster on a fractional probabilistic basis) has accumulated to a cer- 
tain point in the statistics accumulation portion of the program. 

After tests have been made to determine If a cluster may need to be split or 
If the cluster tree may need to be restructured , the old data are subtracted 
from the cluster statistics previously accumulated, and the skewness vector 
and the kurtosls matrix for that cluster are reset to zero. The program then 
continues the statistical accumulation. If a complete pass through the data 
set is made before a cluster Is tested for possible adjustment, the values 
of the means at that time are used In eq. (11) until another pass through the 
data set has been completed; that Is, the program does full iteration for the 
cluster rather than continuous updating. 

The present program cycles through the data a fixed number of time, as con- 
trolled by an external parameter. When the desired number of passes is com- 
plete, the program classifies the data by going through them point by point 
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and assigning each data point to the duster in the cluster tree for which the 
probability of occurrence of this data point is the greatest. This is the 
only time in the program that points are assigned to clusters. When all of 
the points have been assigned, a cluster map showing the cluster symbol for 
each point is printed out. The program also prints out the final values for 
the parameters for each cluster in the cluster tree. A general flow diagram 
for the CLASSY program is shown in figure 1. 


1-7 






Figure 1.- Flow diagram for CLASSY algorithm. 
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2. MATHEMATICAL BACKGROUND 


Descriptions of the detailed mathematics and statistics used in CLASSY are 
given in this section, noting variables which have a direct correspondence 
to theoretical quantities. In addition, preliminary remarks establishing 
equations used in more than one routine are included. 


2.1 FUNDAMENTAL EQUATIONS FOR CONTINUOUS STATISTICS 

The fundamental equations for the continuous statistical parameters follow: 


/V 




(15) 


M. = 


N 

'L 

' 


p<t 


x 0> 


(16) 


N 

x i x J T P(i|x J ) * u i y i T 

where 

, a.p.(x.) 

£»n p n (x J> 

where 

N = total number of points 

m = number of classes 

a. = proportion 

p. = mean 

Z^ = covariance 

and values marked with a circumflex (~ ) denote estimates. 


(17; 


(18) 
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These equations are the standard maxi mum -like 11 hood equations for a mixture 
of normql distributions with unknown parameters and proportions. (CLASSY 
actually solves the equations with the large-sample size approximation by substl 
tutlng u for u in eq. (17).) 

There are obviously many numerical procedures for solving these equations, 
given a specific set of data points Xj. CLASSY uses direct functional itera- 
tions for eqs. (16) and (17) (substitution of the right-hand side Into the 
left side), and a somewhat more complicated scheme for the proportion equa- 
tion, eq. (15). Each iteration is subject to Aitken extrapolation, by fixed 
parameters contained in arrays PACCEL, VACCEL, and MACCEL which are currently 
all zero, corresponding to unaccelerated iteration. 

2.2 MODE OF ITERATION 

CLASSY does not use complete iterations on these equations at all times; when 
the estimates for a cluster are considered poor, statistics are calculated in a 
continuous or updated fashion. Mode selection is on a cluster-by-cluster 
basis. The absolute value of the variable INDEX(KL) is used to store the 
number or "name'* of the cluster. The updating mode affects only the cluster 
parameters a, u, and £, and does not change the processing of skewness, kur- 
tosis, or likelihood ratio. 

2.2.1 UPDATING MODE: (INDEX(KL) > 0) 

Every newly created cluster is in the updating mode, the normal mode for a 
cluster whose statistics are not well defined. Alsc, a cluster processed 
through the routine ADJUST because Its weight (W) exceeds its weight adjust- 
ment threshold (WADJ) is placed In the updating mode on finishing ADJUST. 

Statistics are calculated on a current, or running, basis for clusters in 
this mode. The values of a, p, and l appearing on the left of eqs. (15)- (17) 
change from each point processed to the next. That Is, the sums on the left 
gain one additional point for each point processed, and this most recent 
value of the parameters Is used for the next data point. Cluster parameters 
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are continuously varying, rather than corresponding to some particular Itera- 
tion of the numerical scheme. Although this procedure may seem somewhat 
irregular, it follows the general dictum of numerical analysis, "use the most 
recent data." In fact, the values for a, p, and E produced in this way should 
be better estimates of their true values, and thus lead to faster convergence 
of the numerical procedure. This technique, when combined with the short 
iteration procedure in which only a fraction of the points are processed 
between passes through ADJUST, allows the algorithm to find good coarse esti- 
mates of the cluster parameters relatively quickly, often within the first 
pass through the data rather than after several iterations. However, to be 
effective, the updating iteration mode requires that the data sample have no 
large-scale variation; i.e., that real data be scrambled or disordered to 
destroy point-to-point and regional correlation. 

In the update mode, the cluster vector SUM is used to store Wp (the current 
mean), and OSUM (Old SUM) is used to save the value of SUM when the cluster 
was last processed (created or last ADJUSTed). Similarly, the cluster array 
OVAR (Old VARiance) contains WE* (the covariance at last processing), and 
VRIN (VaRiance INverse) contains the inverse of WE (the current covariance 
matrix). 

2.2.2 ITERATING MODE: (INDEX(KL) < 0) 

A cluster is placed in iterating mode by ADJUST if it is processed by that 
routine due to a complete pass having been made through the data since its 
creation or last ADJUSTment (IDADJ(KL) <_ NPTSO). A cluster in this mode 
observes the complete data sample between processing passes through ADJUST, 
and thus its numerical procedure can be considered to operate iteratively. 
(There is no fixed relation between the start of the iterative cycles of dif- 
ferent clusters - one may start at point 5, a second at point 3000, and a 
third at point 12345. However, all the iterations for a given cluster start 
at the same pixel, unless (very rarely) the cluster reenters update mode.) 

When a cluster is being processed in the iterating mode, the parameters a, 
p, and E appearing on the left side of eqs. (15)- (17) are fixed at the 
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cluster’s last pass through ADJUST and remain fixed until Its next pass. 

Thus, the sums are calculated for an entire data sample using the same values 
of the parameters. This Is necessary to actually solve the equations, since 
processing In the update mode for full passes through the data would lead to 
cyclic drifts in the parameters, which in turn would alter the value of the 
sums and the actual estimates. Although this should not be a large effect 
for scrambled or disordered data, the Iterating mode is necessary to keep the 
program within the design objective of well-defined mathematics. 

In the iterate mode, the cluster vectors SUM and OSUM have the same meaning 
as in the update mode, but OSUM is always used to calculate the mean used in 
the probability calculation. For the arrays OVAR and VRIN, the situation Is 

a 

different; VRIN contains the inverse of W£' (the old covariance) and OVAR 
contains WE (the current covariance). This is necessary because VRIN is used 
in numerous places as the inverse of the covariance matrix, both in the equa- 
tions and as the metric for the space of data vectors. 

2.3 TREE STRUCTURE 

In addition to the continuous variables, CLASSY maintains a discrete tree 
structure of clusters and several continuous statistics used in updating this 
structure. This discrete structure differentiates CLASSY from a simple 
maxi mum- likelihood approximator using accelerated numerical techniques. 

Under each cluster there may be two or more subclusters, each with a subclus- 
ter structure of its own. The sum of the distributions of the subclusters of 
a cluster is an alternative distribution to that represented by the cluster 
itself. As CLASSY is presently constituted, only the case of one cluster 
versus many can be represented; many versus many Is not allowed. (One versus 
one is effectively handled by the continuous statistics section.) 

Between each cycle through ADJUST, STATIS accumulates a likelihood ratio 
between the parent cluster and its subclusters. The natural logarithm of 
this ratio is maintained in the cluster variable SPFAC (Splitting FACtor), 
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where SPFAC much less than 0 favors the parent cluster and SPFAC much greater 
than 0 favors the subclusters. 

This likelihood ratio is used in ADJUST to make a final decision in favor of 
the parent cluster or its subclusters; the losing alternative is deleted by 
SEPER or by SUBLIM. SPFAC is also used in STATIS to choose the parent clus- 
ter probability or the sum of the subcluster resultant probabilities. The 
selection is done by a continuous changeover via ( p parent + ZP S ubs^^ + z ^’ 
where Z = exp (SPFAC) is the likelihood ratio. 

SPFAC is initialized when a cluster is created and each time it is ADJUSTed, 
and is set to the value returned by the subroutine APRIOR (currently a con- 
stant). This is the point of entry in the program for the necessary bias 
factor against large numbers of clusters, and for various volume normaliza- 
tion factors discussed in more detail under the APRIOR routine. The cluster 
variable OPRIOR (Old PRIOR) is also set to this initial value of SPFAC to 
retain the initial value for reference. 

Another statistic accumulated by STATIS is the square of the normalized 
probability difference between a parent cluster and the sum of its subclusters, 
PQRAT. 

2 

PQRAT = 09) 

\ sub parent/ 

This is a crude measure of how much a parent cluster differs from the mixture 
of its subclusters. If the subcluster total becomes nearly the same as the 
parent cluster density, and the likelihood ratio does not favor the subclus- 
ters, then ADJUST assumes that the parent is the "real" cluster, and elimi- 
nates the subclusters via SUBLIM. This is a common situation, since if the 
subclusters do not fit the data better (e.g., if the data are best fit by a 
single normal) then the subcluster parameters will change until the sum fits 
the single normal. Thus, subclusters are generally eliminated by becoming 
very similar to the parent cluster, rather than directly by an unfavorable 
likelihood ratio. 
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The other statistics maintained for each cluster are the traces of the skew 
ness and kurtosis tensors (not divided by the total weight). 


SKEWj 

■Ev ! 

(20) 

KURT 1j 

= £ W 2 

(21) 


where 

x. = x i - y. ; x 2 = x i x J (r 1 ) iJ = 7 t Z‘ 1 x 

These are accumulated between ADJUSTments (or creation) of a cluster. (They 
are zeroed each time through ADJUST.) These statistics are summarized to 
make three scalar statistics in ADJUST and are then tested against thresholds 
calculated in CLINIT, to decide whether to consider splitting a cluster. The 
three summary statistics are derived in section 2,5 of this report. It is 
not necessary to calculate these statistics for a cluster with subclusters 
since such a cluster cannot be SPLIT again. 

Three other variables maintained in CLASSY concern the normalization of the 
distributions and the volume integrals which are included in the normalization 
factors. These three are the cluster variables VOLIN, VOLRT, and DCON. The 
overall coefficient for a cluster probability density (excluding its propor- 
tion) is exp(-DC0N/2)/V0LRT, VOLRT * V0LIN 1/2 , thus 

VOLIN * exp (DCON) = (2*)^ det Z 

The splitting of the coefficient into two parts is made necessary by the fact 
that the Z actually used Includes W as a factor, and after a determinant is 
taken, this is raised to the power of the number of channels, which can lead 
to exponent overflow/underflow conditions. The relative division of the 
coefficient between VOLIN and DCON is made in a constant fashion in ADJUST. 

2.4 EQUATIONS FOR THE PROPORTION CALCULATION 

The mean and covariance parameters for a cluster are calculated using direct 
substitution of the left-hand side of eqs. (15) through (18) into the right- 
hand side. 
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However, this procedure is not very sensitive for the proportion equations. 
Because the proportion values enter very strongly into all the equations, 
it was deemed desirable to use a somewhat more complicated but faster con- 
verging system of equations to obtain the proportions. 

It should be noted that there is no single iterative procedure for solving a 
given set of equations; there are an infinite number of systems with the 
same fixed point and differing rates and manners of convergence to this fixed 
point. In particular, direct substitution using eqs. (15)-(18) commonly taken 
to be the maximum -likelihood procedure is only one of many. While equa- 
tions (15)-(18) are the maximum-likelihood equations, many differing techniques 
for solving them can be found in any standard text on numerical analysis; 
e.g., changes of variable, Newton's method, Aitken extrapolation, and methods 
derived from these. It is almost never true that the apparently simplest 
numerical technique is the best. 

Although CLASSY was originally designed to use a sparse matrix variant of 
Newton's method, the only special techniques actually employed are the update 
mode, the extrapolations in ADJUST, a Monte Carlo method used in STATIS, and 
the following modification of the proportion calculation system. 

Substituting eq. (18) into eq. (15) .?nd deleting j-sum parameters and 
circumflexes : 


a = iy a i p i 

a i NZw m 


Z a kP 


k=l 


k H k 



( 22 ) 


where 



(23) 


Now a^ does not depend on j , so the two sides may be canceled, getting 


1 


- 1 V !i 

fils p 


(24) 
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We may now define 


m 


n-rh-E 

1 1 a 1 k=l 

Mi 


l kP k 


so P « ap + (1 - a)q. Transposing eq. (24) and combining, we get 


0 = 


0 - a ,)r-> P, - q 


N 

1 - a 

T 


f 1 - Mj 

^a i p i + 0 - a i )q. 


where 


(25) 


D. = 


Pi - q 1 


i " a i p i + (1 - *.)q. 


Now D.j just changes sign if and q^ are switched, or equivalently, class i 
and not class i. It is the simplest variable in which to represent the pro- 
portion problem. In terms of the direct substitution proportion itera- 
tion is 


a i + a ,- (1 - a <> i?I>i 


As expected, when eq. (24) is satisfied, a' = a. 


(26) 


It can be shown that for completely separated classes (p^ ■ 0 or q^ =0 always) 

the direct substitution is exact In one iteration; that is p. * 0 implies 

-1 1 1 
Dj = j _ fl - and q^ s 0 implies 0^ ■ — . However, for points In between, the 

behavior of this iteration is soft and may coverge slowly, because (as seen in 
the derviatlon of eqs. (24) and (25 ) ) the intrinsic variability which allows 
the system of equations to be solved is in the denominator, a^. + (1 - a^)q^, 
since it contains the only dependence on a^. Because complete separation 
allows the a^ in the denominator to come out, it is possible for eq. (26) to 
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be exactly solved in this case. Mixed points always contribute part or all 
of their weight to holding the "status quo," or delaying the convergence of 
the equations; these points are always entered into the sums using the old 
value of a.j, and the form of the iteration does not take into account their 
changing contribution. 


For example, if we consider a Newton's method iteration (assuming we are 
close to the solution), then we have 


using 



(27) 


This approach substitutes , ■ -? for a ,-0 - a.) in eq. (26). 

k£°i 


This is exact for complete separation, but otherwise Newton's method gives 
better convergence, because the points at p^ = q. do not contribute to the 
denominator at all. 

The method used in CLASSY to calculate proportions is a simple system of thi 
same general type which should not display any major problems. 

By splitting the sum in eq. (25) into parts corresponding to p^ > q^ and 
p^ < q.j and dropping the class index i, we have 

E ° + E d = ° 

p>q D<q 

where every part of the first term is positive, and the second term is nega- 
tive. In line with the cancellation leading to eq. (24), this becomes 



where a' * a implies a solution. Solving for a', 



(recall P = ap + (1 - a)q). In the first form, both denominator terms are 
positive. 

The second form corresponds to the variables used In CLASSY - the numerator 
Is retained In the variable CIN, and the terms 

S P + £ P 

p>q p<q 

used in the denominator are the variable CTOT. The actual proportions 
currently used in CLASSY are the proportion of a cluster relative to Its 
parent cluster. The proportion as actually used Is calculated for each pixel 
in the first loop of STATIS (In update mode), and In ADJUST using eq. (28), 
and Is retained in the class variable PROP(KL). The routine DENCAL (Denomi- 
nator CALculation) Is used to readjust the values of CIN and CTOT when the 
tree Is restructured. This Is required by the use of proportions relative 
to the parent. 

The third form shows the relation of this system to direct substitution, 
eq. (26). The motion Is amplified if there are mixed points, and the 
denominator sums are 0 If p * 0 or q * 0. The relation of this amplification 
and that used by Newton's method (eq. (27)) has not been analyzed. 
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The proportion system in CLASSY could be improved. The use of proportions 
relative to the parent cluster appears to be a design fault, although the 
effects of this choice are fairly pervasive, and the chief benefit achieved 
by making the change to absolute proportions would be a cleaner program. 

The actual proportion calculation system used was a first attempt at achiev- 
ing better convergence than is allowed by direct substitution, and it is sub- 
ject to all the problems of first attempts. It appears at present that a 
Newton's method approach, if combined with a change of variables or a model 
which prevents overshoot (violation of 0 £ a £ 1) would be best. It is pos- 
sible that such a scheme could be cast into a form sufficiently similar to 
the one used here that only the equations for updating CIN and CTOT in STATIS 
would need to be changed. Note that the separation between p > q and p < q 
is somewhat arbitrary; in fact, a fractional separation which appears close to 
eq. (27) and is a function of the current proportions could be used. In any 
such system, the designer, besides making sure that Newton's method did not 
overshoot, must also ensure that the system combines properly with the contin- 
uous changes present in update mode. Such a change should improve the conver- 
gence behavior of the program. In designing such a system, it is useful to 
work in terms of the symmetric variables b = 2a - 1 and r * where 

-1 < b and r < 1 . Then o * r l- BF , and all the formulas pick up a useful 
symmetry under b -*• -b, r -*• -r. The overall proportion system must be invariant 
under this transformation. 

2.5 LINEAR ALGEBRA AND GENERAL LINEAR TRANSFORMATIONS 

The CLASSY algorithm was designed to be invariant under arbitrary linear 
transformations of the brightness space. That is, if we transform all the 
pixels or data points via 

x' = MXj (29) 

where M is an arbitrary fixed nondegenerate matrix, then CLASSY should give 
the corresponding results: same cluster tree, same probabilities for each 


2-11 



point to be In a cluster, same proportions, i 
should be changed by 

U\ * Hu 1 
Z\ * ME^* 

Thus CLASSY Is transparent to general linear coordinate transformations: It 
does not "know" what coordinate system It Is In. 

There are two exceptions to this rule, both In ADJUST: 

a. CLASSY assumes that the Input data have been discretized with Interval 1 
along the coordinate axes. This must be compensated by a Shepherd's cor- 
rection which Is not generally Invariant, since the discretization Is not 
coordinate Invariant. When the algorithm was operated without this cor- 
rection It tended to find point clusters, or clusters with zero covari- 
ance along crystal planes of the discretization lattice, which are, after 
all, the true clusters In the data. 

b. To limit computing time, the calculations used to select candidates for 
JOINIng compare the diagonal elements of the covariance matrices for sim- 
ilarity. This is a noninvariant operation. 

In d channels, the set of nondegenerate matrices form a mathematical group, 
called GL(d). Vectors form Indivisible spaces under the group, loosely called 
representations. Breaking the system down Into Irreducible representations 
greatly simplifies the analysis of the clustering problem, since only certain 
operations and relations are proper. This type of Insight was used through- 
out the design of CLASSY. In particular, the only way to take the product of 
two vectors Is with the Inverse covariance matrix. 

A method of notation for systems of the type which are Invariant under GL 
groups Is called the "summation convention." In it, normal (column) vectors 
are represented by quantities with subscript indices (x^). Vectors In the 
dual space (row vectors) are represented by superscript Indices (not to be 
confused with exponents which almost never occur In this context), as (A 1 ). 


ttc. Means and covariances 


(30) 
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A typical dual vector Is the derivative with respect to a normal vector: 

^ * 3/3Xj. Other objects may have multiple upper or lower Indices, or both. 
If an index appears twice in a term (as a superscript and a subscript), a sum- 
mation over all values of the pair, called a contraction, is automatically 
implied. All the rest of the indices are called free Indices, and must appear 
identically, including upper or lower position, In all terms added together, 
and on both sides of an equation. This notation is extremely convenient for 
use with vector/ tensor systems such as those appearing in CLASSY, and will be 
invoked several places in this report, where it is irreplaceable. Examples 
follow: 
i 1 

T y Mj ordinary transformation matrices 

P r = T j ^ ^ T j the1r matrix P roduct P - MT 

Q I = Tj m! = m! l{ their product P = TM 

r r j j r 

(Note that order does not matter: the order information is contained in the 
index pairing.) 

mJ the trace of M, Tr M 

Tj M'j the trace of TM, Tr(TM) = Tr(MT) 

sj the Kronecker delta, * 1 if i ■ j, otherwise 0 (equivalent to the unit 
matrix; like all matrices must always have one upper and one lower index) 

u’ « J definition of the mean 

l. , ■ i Ex.x . - y.y. definition of the covariance (Note that l is not a 

lj N 1 J 1 J 

matrix, but a rank 2 covariant tensor. No transpose operations appear using 
this notation.) 

(E -1 )^ Its inverse; note the index motion, which is required 
(E*V^j r * <5 J. the definition of the inverse 
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* £ j1 states that 1 * s synmetric {cannot be said of a matrix) 

A ij * * A ji sta t es that A Is skew- symmetric (cannot be said of a matrix) 

x i 3 rj Xj transformation of x by the matrix T 

k i 

r ij 3 T 1 T j E k£ transformation of the symmetric tensor (covariance) by the 
matrix T 

E -1 ^ = (T“^)j*(T'“^)^(E” 1 ) ki ' transformation of the Inverse of £ by T (Note 
the index differences from above.) 


Much of the conventional nomenclature for multivariate statistics becomes 
incorrect In this viewpoint. The covariance Is not a matrix, but a rank 2 
covariance tensor, £. . , and Its inverse Is a rank 2 contravariant tensor, 

(Z ) J . Correspondingly, the use of the "transpose" operation Is invalid; 
It is only necessary in the standard nomenclature because of the attempt to 
represent the covariance as a matrix. Except in this section or as stated 
the standard nomenclature will be used to Improve communication. The above 
notatlonal convention Is well known and fairly standard; the reader may find 
more detailed accounts in any elementary book on group theory, In texts on 
mathematical physics, or in books on multivariate analysis. 


Occasionally In this report, the term "scalar" Is used to refer to quantities 
that do not change under the transformation; e.g., ordinary numbers. 


The measures of skewness and kurtosls used In CLASSY are the traces or con- 
tractions of the complete skewness and kurtosls tensors. The complete ten- 
sors are 


s ijk * i?£ x iYk 
1 


(31) 

j * 

*i in ■ n £ VjVi (32) 

where x * x - p. In 16 channels, S^ has 816 components, and has 3876; 
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they are thus too large to be employed efficiently in a program such as CLASSY, 
and instead their traces are used: 




,-lxks, _ 1 


K ij = K ijkJ^ Z ) = nZ) x i x j x k x J^ Z = nS x i x j x2 

_2 __v 

where x is the distance from the cluster center: x = xEx . 


(33) 

(34) 


The cluster vector SKEW is (W - OW) * , and the array KURT is (W - OW) * K^.. 

(The OW is present because SKEW and KURT are rezeroed each time out of ADJUST.) 


S. transforms as an irreducible vector under GL(d), but K . . can be separated 

* I J 

into two components: 


and 


K = K. j (£' 1 ) ij 
K ij ■ K ij - 1 E ij K 


(35) 


(36) 


Note that K?.(E~ ) ^ = 0, so that this operation is referred to as separating 
* 

K . . into its trace and its traceless components, both of which are incapable 

• J 

of further reduction. K?, has no associated scalar, therefore the lowest 

* J 

order scalar terms available from these contracted third 4 nd fourth moments, 
in addition to K, are: 


S 2 = S.S.(£~V J 

^ J 


(37) 


(K °) 2 = K|jK|, j , (Z " 1 J 11 * (E " 1 ' 

= k 1j .k. .j.O:* 1 ) 11 ^" 1 )^' - y 

These are the simplest forms; the rest are of higher order in both K, S 
and x (such as S^E’ V^K^E - ^^). 


(38) 
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These three quantities are used in ADJUST to test whether a cluster appears 
normal or is a candidate for splitting. Currently all three are used; but if 
the algorithm were modified to allow nonnormal and skewed clusters, then only 
(K°) gives meaningful information about split clusters. 

In ADJUST, TRK represents K (and later [K - d(d + 2)]/ / W - OW ) , normalizing 
to the expected random variation). SK represents DW * S and URK represents 
DW * (K°) . SK, URK, and the later TRK are all normalized so that random fluc- 
tuations have a size independent of DW. 

The three components each have a specific meaning: 

_2 

a. S is a vector indicating how far and in which direction the base (hiqh x ) 

_2 

of the distribution is shifted from the peak (low x ). 

b. K has the value d(d + 2) for a normal distribution. Larger K's represent 
distributions which are more pointed than a normal distribution indepen- 
dent of direction (lepto-kurtosis) ; smaller K's represent distributions 
which are less pointed and with smaller "tails" (endo-kurtosis) . 

c. The third tensor, K°, represents the tendency for points in each set of 
opposite directions to be at a different distance from the center compared 
to some of other pair of directions. In other words, K° measures how 
"lumpy" the distribution is when observed on a sphere at some fixed dis- 
tance from the center. Since this lumpiness is characteristic of multi- 
modal distributions in several dimensions, K° is really the best measure 
of multimodality used in CLASSY. The other tests have been included to 
maintain consistency with the formal description of the program as fit- 
ting the given distribution with a mixture of normals. Also, unless other 
tests or a precise formal model were used, these tests could ultimately 
mask multimodal situations if ignored. 

2.6 EQUATIONS FOR A MIXTURE OF TWO DISTRIBUTIONS 

The following formulae are the first four moments of a general mixture of two 
normal distributions; essential use is made of them in the routines JOIN and 
SPLIT. The summation convention is used. 
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2 

Consider a normal distribution in d variables with covariance c , in two cases: 
(1) total weight 1, mean 0; and (2) total weight a, mean y. In the following 
table, the portion in single quotes gives the basic character of the term, 
which must then be symmetrized on indices. 


Weight 1 , 

Moment mean 0 

0 1 

1 0 


3 0 


'* 4 ' ' « 

+ a ik°jH + a U a jk 


Weight a, 
mean u 


a 

ay 

a (y .y . + o. . | 

\n J u I 

■ap(u 2 + 3a 2 ) 1 = a(u 1 u j M k ♦ u,aj k 

2 2 \ 

+ Yik + Vij) 


'a(u^ + 6 u 2 o 2 + 3o^)' = a ( u i u j u k u e. 

+ + u i u k°i£ + 

+ “jVu + •‘j'V’lk 

2 2 2 2 2 2 2 \ 
+ Vk a ij + °i.i 0 kJi + a ik a jH + a U a jk) 


A mixture of two such distributions will be used. 



Cluster 1 

Cluster 2 

Remark 

Weight 

a 

b 

a + b = 1 ; let 

Mean 

y 

V 

Let 6 = y - v 

Covariance 

2 

a 

2 

T 

2 2 

Let D = a - 


2 

These are viewed as a single distribution, with weight 1, mean y, covariance z 
(inverse E“^), skewness S, and kurtosls K (with the 3E 4 term also subtracted), 
as shown in table I. 
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TABLE I. MOMENTS OF THE MIXTURE OF TWO NORMAL DISTRIBUTIONS 


Moment 


Remark 


1 * a + b 
u * aw ♦ bv 

1 £ s abS 2 + aa 2 + bx 2, 

Z ij = ab Vj + aq ij + bT ?j 
s, - (r' 2 ) k)l s. La 


_ ,_-2,k£ /l_ ? 2 2 

■ ( } U Zx i x k x £ ’ u 1 z k£ " V*U ■ U £ Z 1k ’ 

* 'ab(Z’ 2 )<5(3D 2 - C5 2 )' 

- 4 Vk 5 *) 

K 1j ■ < E ’ 2 >%M 

■ ' 1 " 2 (s** - 4 “ s ( 3 ) - 6E 2 h 2 - „ 4 - 3Z 4 ) ' 

' U 2,ki (ff I:X 1 X j X k X t ‘ u f S jki * “j S lkt - “k S tjt ’ "tS, jk 

• U 1 U j E kt - - Wik - Kjl'k'tt ' “jVfk - Vt E fj 

• •H'-jvt - e u 4 - i ?k i j t - ^J k ) 

'E 2 (3abD* + ab(l - ab)6 4 + 6cab6 2 D 2 )' 

• 44 *(Vj & * Vk°jl ♦ wjk ♦ « lS|t 0 2 t * 


Using the definition of c. 
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3. DESCRIPTION OF SUBROUTINES 


Most subroutines used in CLASSY have simple, nonmathematical descriptions ; 
they do simple bookkeeping, structural manipulations, or simple functions such 
as printout or matrix algebra. These routines are not described in this section 

The following subroutines which have mathematical properties and mathematical 
descriptions are described in this section: 

STATIS (Statistics) 

ADJUST (Adjustment) 

JOIN (Combines) 

SPLIT (Splitting) 

DENCAL (Denominator calculation) 

APRIOR (A priori distribution) 

ISPLIT (Is split) 

EIGROT (Eigenrotation) 

STATIS and ADJUST are the subroutines which control the processing in CLASSY. 
STATIS handles all the incremental statistics, essentially doing all the accu- 
mulation required by eqs. ( 15)- (18) . STATIS also contains the code generating 
the DO loop over the data points (which could have been placed outside it). 
ADJUST is called by STATIS for each cluster on a specified basis (if either of 
two given thresholds are exceeded). ADJUST in turn does all action on a clus- 
ter which is done on a lumped basis: making tests for split clusters, sep- 
arable clusters, joinable clusters, etc. ADJUST is also in charge of all 
extrapolation of continuous parameters, subtracting old data from the sums of 
eqs. (15)-(18) so that the system can update or iterate properly, not depend- 
ing on bad data values from earlier data or iterations. In general, ADJUST 
handles any operation that is not executed every time a point is entered into 
a sum and is in charge of testing for and calling all the tree-restructuring 
operations. The structure of STATIS is fairly well dictated by the mathematics, 
while that of ADJUST is largely heuristic. 
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3.1 STATIS 


STATIS, besides generating the pixels and triggering ADJUST, calculates the 
quantities in eqs. ( 1 5)- ( 18) , together with a few other statistics such as 
skewness, kurtosis, and likelihood ratio which are used in the structural 
change tests in ADJUST. STATIS also contains the logic which modifies the 
calculations of these statistics, depending on whether a given cluster is in 
update or iterate mode. The decision as to which mode a cluster should be in 
is made in ADJUST. 


STATIS is divided into two blocks, or loops, over the cluster tree. The first 
of these calculates the class conditional and posterior probabilities for each 
class, and the second updates all the statistical sums. The first loop consumes 
most of the execution time of CLASSY, si^ce all clusters must be processed 
through each pixel, and for each a quadratic form must be evaluated. This loop 
starts at FORTRAN statement 31 and continues to just before statement 150 
(see section 4.1). For each point, the first pass defines for each cluster KL: 

PCUM(KL) = (Probability acCUMulator) 

= X) a.p^x) 

ieKL 1 1 



where ieKL indicates i is a subcluster of KL, and PCUM is later normalized 
by PRIRCM. 

PRIRCM(KL) = (PRIORs acCUMulator) ) 


- £ a j 

ieKL 1 

PROP(KL) = (PROPortion) 

* CIN(KL)/[Wp - CTOT(KL)] 
where Wp is the weight of the parent 

PCOND(KL) = (CONDitional Probability) 

] _ / V2(x-u KL ) z Kl 1 (x-w KL ) 


* P KL (x) * 


(2n) d/2 |z KL | 


(40) 


(41) 


(42) 
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A selection is made whether to use PCUM, the sum of the subcluster probabil- 
ities, or PCOND as the output probability for this cluster (PST). This is made 
in a continuous fashion, 

PST(KL) - « WOMOW.+ e"™ PCUM(KL) (43 ) 

1 + e 

essentially selecting the cluster or its subclusters in proportion to the 
likelihood ratio between them. For SPFAC sufficiently large or small, PST(KL) 
is clamped to PCUM or PCOND, respectively, with thresholds XOVFLO and XUNFLO. 
PST(KL) = (P STored) is the final output for cluster KL. 

The subroutine CORECT has been used to calculate x - p for each cluster, stor- 
ing the result in REL (RELative pixel), and the cluster variable DISS contains 
the value of the quadratic form (x - y)^l"^(x - u) including DCON. When this 
form is too large, the cluster is not processed for this point. 

The second loop of STATIS updates all the statistics being accumulated for the 
cluster: direct statistics, skewness (SKEW) and kurtosis (KURT), the log 

likelihood ratios between a parent cluster and its subclusters (SPFAC), and 
of the probability difference between the parent and its subclusters (PQRAT). 
This latter variable is used to determine if the parent cluster and its sub- 
clusters have come to be practically the same distribution. The temporary ZQ 
used here is r « B -y ^ as described in section 2.4, and a short approximation 
to the log is used in computing SPFAC. 

All the basic variables updated (W, SUM, VRIN, CIN, CTOT, SKEW, KURT, SPFAC, 
and PQRAT) were described in section 2. 

Three dependent variables are also maintained by STATIS: VOLIN, VOLRT, and 
DCON, described in section 2.3; with VRIN, these have special updating require- 
ments which must be described. 
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VRIN is the inverse of the covariance a- • , which is not directly maintained. 

' J 



where - y.. or 

a =J]x x T 

using matrix notation. Uodating a 

a' = a + x x t 


Then 


VRIN' = (a + x xV 1 

= a~^ - a^x" x*a"1 + a~^x x^a^x x^a"^ - x ••• 

= VRIN - (a‘ 1 x)(x t a “ 1 )(l - xV 1 ! + (xV 1 !) 2 ... 


= VRIN 


(a~ 1 x)(a~ 1 x) t 
1 + x^ -1 ^ 


( 44 ) 


which may be verified by direct multiplication. Note that x^a’^x is the 
exponential argument from the multivariate normal distribution. 


Similarly, if V = det(a), the determinant of a, then 


V ' = det (a + x x t ) 

= det(a) (1 + xVx) 
= V(1 + xV ] x) 


( 45 ) 


This may be obtained by expanding x x* terms by minors, and using the defini- 
tion of the inverse in terms of cofactors. VOLIN is the same as det(a) up to 
a constant factor, and is updated by the same formula. 


VOLRT is updated by one cycle of Newton's method to follow (VOLIN)^ 2 , and 
DCON is updated (using an approximation to the log), to compensate for the 
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factor of W**AMQ which appears in VOLIN and affects VOLRT. This factor appears 
because the terms in the covariance are added before the determinant is taken* 
but without dividing by the total weight to get the actual covariance. This is 
unavoidable and is easily handled using the DCON variable. The formulas act- 
ually used in the program are generalizations to the case with variable weights. 

After updating all the statistics, STATIS checks to see if a variable is over 
threshold (W _> WADO or NPTSO > IDADJ) and saves the index of at most one clus- 
ter for passing to ADJUST after the current data point has been fully processed. 
The delay is necessary to avoid overlapping usages of certain EQUIVALENCE clus- 
ter variables and to otherwise ensure that processing is complete in spite of 
the restucturings. 

In iterate mode, everything but VRIN, VOLIN, VOLRT, and DCON is updated simi- 
larly; PROP is not updated from CIN and CTOT in the first loop in this case, 
and OSUM is used rather than SUM in calculating the mean. In addition, OVAR, 
which in this case represents the current rather than the old values of the 
variance, is updated directly. The vector COVEC is g (a x), ALOW = 

ALPHA = ~rP , and 


COFI = 


-WP 


W' 


1 + p (x V Vl 

1 K W 


where 


W = the old weight 

W' = the new weight for the cluster 

COFI = the coefficient used to update VRIN 

Although eqs. (15)-(18) indicate that every point is added to every cluster, 
certain shortcuts were taken in CLASSY to speed processing. Normally, each 
data point will have fairly large probabilities for one or a few clusters. The 
remainder will be far out of range, with their probabilities damped by a large 
exponential argument. These terms could be deleted without causing any problem 
and would usually be effectively eliminated by the machine's floating-point 
hardware. Classy handles this situation in STATIS at the Monte Carlo loop 
starting at statement 132. 
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To avoid introducing a bias* a small number of the low-probability points are 
given a correspondingly increased probability and processed normally. The 
procedure follows: first, any point with probability greater than the thres- 
hold parameter PLIM (defined in CBLO) is automatically processed. For the 
remainder, an integer from 0 through MONTE-1 is selected randomly by the ran- 
dom number routine DISC. If that number is a specified value (1), then the 
probability is multiplied by AMONTE. (AM0NTE=M0NTE is also a parameter. ) 

The program then returns to the PLIM threshold check and continues to loop 
until P > PLIM or the random integer misses the specified value. If the ran- 
dom integer ever misses, then the current point is not processed through this 
cluster. Thus, if the probability for a given point in a given cluster falls 

below PLIM, the probability has a chance to be multiplied by MONTE and 

. MONTE-1 TIL , ....'1 


considered further, and a 


chance to be Ignored. The bias which easily 


MONTE 

crops up in a tail -truncation procedure is thus eliminated. 


Presently, the values of PLIM and MONTE (AMONTE) are fixed throughout a run. 
If the second half of STATIS should ever consume excessive time, a modifica- 
tion could allow a large value of PLIM during early processing to be followed 
by a drop to a small value for the last few passes. 


3.2 ADJUST 

ADJUST is entered periodically to adjust a cluster via extrapolation of data, 
and elimination of old data from the continuous statistical parameters of a 
cluster, and to make the tests required to decide on discrete transformations 
of the cluster tree. Most of the separate operations occurring In ADJUST 
are unrelated. ADJUST also gets and frees storage for temporary matrices used 
by itself and by the discrete transformation subroutines it calls, particu- 
larly SPLIT. 


Before returning, ADJUST sets the old values of all the statistics to their 
current values and calculates the thresholds WADJ (Weight ADJust) and IDADJ 
(ID ADJust) for the next call to ADJUST. WADJ is set to a quantity which 
exceeds W by an amount which is the increase allowed in W before the next 
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ADJUST. This increase is currently a fixed parameter DWFA.C > 1 times W f but 
could be set to a value dependent on the stability of the cluster. The IDADJ 
threshold is included so that data is not double-counted; that is, to ensure 
that every cluster is ADJUSTed at least once for every pass through the data. 
ADJUST sets IDADJ to the current point number (NPTSO) plus the total number 
of pixels in the data set (TOTPIX). Before returning, ADJUST also determines 
the mode of the cluster (update or Iterate), depending on whether ADJUST was 
entered due to WADJ or IDADJ, respectively. 

In processing the continuous statistics, ADJUST first subcontracts the old 
values of the accumulated statistics from the current values to ensure that 
no data older than the previous call to ADJUST (or the cluster creation) is 
included in the new statistics. In addition, the motion of the parameters 
since the last call to ADJUST is calculated, and the new parameters are set 
to overshoot their current, subtracted values by this motion times certain 
acceleration factors. 

The acceleration factors (currently 0) are a function of whether the cluster 
was in update or iterate mode, and are in the arrays PACCEL (Proportion 
Acceleration), VACCEL (mean (Vector) Acceleration), and MACCEL (covariance 
(Matrix) Acceleration). These are indexed by the internal temporary KADTY 
(KADTY = 1 for update mode, = 2 for iterate mode). These extrapolations are 
done using the temporary EXF, which contains various weight factors via the 
temporary WINFC. The scheme is an ordinary extrapolation or accelerated con- 
vergence scheme for a set of equations in a set of variables, and will not 
be discussed further since presently the parameters are zero. The CTOT's of 
the subclusters and sibling clusters must be modified during the updating, 
due to the use of relative proportions. 

The discrete changes in the cluster tree are made whenever a cluster being 
adjusted passes a test for some particular change. There are five such cluster 
tree transformations; they are listed in table II with the routine makinq the 
transformation, an abstract of the test used, and parameters from common, upon 
which each test depends. Parameters used in the WADJ calculations are also 
included. 
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The statistics and tests used to determine if a trial SPLIT is to be made are 
described in sections 1.1 and 2.5. The likelihood ratio tests and PQRAT test 
which govern the calls to SEPER, SUBLIM, and ELIM are also discussed in sec- 
tions 1.1, 2.3, and 2.5. 


The test to determine whether a trial join is advisable is based on a heuris- 
tic criterion which compares the mean vectors for two clusters and the diag- 
onal elements of the covariance matrices. This criterion is given by 



where 

W. = current weight for cluster i 

A and B = arbitrary constants (currently, A ■ 0.3 and B * 0.18) 

The first term in the numerator is the distance between the mean vectors of 
clusters i and j, weighted by an average computed from the inverse covariance 
matrix fcr clusters i and j. The second term in the numerator is a measure of 
the difference in the diagonal elements of the two covariance matrices. The 
diagonal elements rather than the full covariance matrices are used for compu- 
tational simplicity. A more complete expression involving all covariance terms 
is Hn det EjEjJ 1 . The denominator is designed to discriminate against small 
clusters in the sense that R.. will be artifically reduced if the weight of 

• J 

one cluster is small relative to the weight of the other cluster. This factor 
is designed to give large clusters an opportunity to absorb small clusters if 
such a join does not substantially affect the statistics of the larger cluster. 

The Ry criterion is computed for certain clusters having the same parent as 
cluster i; the clusters to be checked are selected on a Monte Carlo basis. 
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If the cluster j for which is a minimum is less than a fixed threshold, 
the hypothesis is raised so that clusters i and j are really the same cluster. 
In practice, a new cluster at the next higher level in the cluster tree is 
created, with parameters and other statistics obtained by combining the values 
for the two similar clusters. This is accomplished by calling the JOIN sub- 
routine. 
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TABLE II.- CLUSTER TREE TRANSFORMATION ROUTINES 


Transformation 

Routine 

Test 

Parameters 

Generate two subclusters 

SPLIT 

Trace kurtosis, skewness, or trace- 
less part of kurtosis too large. 
Must not yet have subclusters 

TRBND, SKBND, URKBND, 
TRCHI , SKCHI, URKCHI , 
WAIT 

Eliminate this cluster in 
favor of its subclusters 

SEPER 

Likelihood ratio strongly favors 
subclusters 

SEPTH 

Eliminate all subclusters 
of this cluster 

SUBLIM 

Likelihood ratio strongly favors 
parent clusters, or likelihood 
ratio is mediocre and subclusters 
very similar to parent cluster 
(using PQRAT) 

SBLTH , PQRATH, SPMVTH 

Eliminate this cluster and 
any subclusters 

ELIM 

This cluster proportion too small 
(and NOELIM switch is off) 

ELIMTH, NOELIM 

Make this cluster and a 
sibling cluster subclusters 
of a new cluster 

JOIN 

Sibling most similar to this in 
mean and covariance diagonal 
elements is sufficiently similar. 
Siblings to be checked are 
selected on a Monte Carlo basis. 
(Procedure is quite heuristic) 

WDJOIN, PJOIN, VRJOIN, 
RLIM 

Next adjustment point 

WADJ 


DWFAC, WSIM, WDELSM 



3.3 JOIN 

The subroutine JOIN takes two clusters which are subclusters of the same 
parent and combines them into one cluster, which is a subcluster of the orig- 
inal parent and has as subclusters the two original clusters. Aside from the 
bookkeeping necessary to modify the cluster tree, it must also calculate the 
statistics of the new cluster. 

The weight and adjustment threshold of the new cluster are defined by param- 
eters. Its likelihood ratio is given the APRIOR value, and PQRAT, SKEW, and 
KURT are zeroed. Its proportion is the sum of the proportions cf its com- 
ponents, with the numerator CIN having the weighted average of the subcluster 
CIN's. DENCAL is called to give the subclusters the same relative weights 
they had previously, but relative to the new parent cluster. 

2 

The mean and covariance of the new cluster are y and £ from table I, calcu- 
lated directly in the DO loops ending at FORTRAN statement 21. Referring to 
table I, the relevant variables have the values 

CA = a 

CB = b (new statistics - update mode) 

CBV = b (old statistics - iterate mode) 

(Two values are necessary because of the different handling of some 
variables . ) 

CF = temporary (contains ab) 

FA = temporary for weight adjustments 

DELTA = atxS 

Wg = weight of the second subcluster 

Wj = weight of the new cluster 

The new statistics are stored in the standard form, and the mandatory calcu- 
lations for VOLIN, VOLRT, and DCON are made, along with the mandatory storage 
of "old" statistical values. 
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3.4 SPLIT 


SPLIT is a subroutine which guesses the optimal axis on which to split an 
existing cluster. SPLIT is called after ADJUST ascertains that the cluster 
should be split; SPLIT has available the skewness and kurtosis data for the 
cluster as additional information to use in finding the proper split. After 
SPLIT has calculated the proper statistics for the component clusters, it 
builds the clusters and links them into the cluster tree. The parent cluster 
is not changed in any way, but the new clusters are linked to it as subclus- 
ters. SPLIT does not actually split a cluster, but determines the best way 
to consider splitting it. The actual decision to split is made on the basis 
of likelihood ratio information. If final splitting is required, it is car- 
ried out by the routine SEPER. The two new clusters formed by SPLIT are cre- 
ated with small values of W and WADJ to allow them to move rapidly to fit the 
actual distribution, and to be ADJUSTed quickly. The new clusters are con- 
sidered to be guesses, and are treated accordingly. 

The equations to be solved by SPLIT are those for the mixture of two distri- 
butions (table I). SPLIT first puts everything in a frame with an overall 
mean M of zero, and where the overall covariance is the unit matrix. Under 
this convention, rank 2 tensors may be called matrices, or reference may be 
made to the inner product of two vectors, etc. Whenever such a usage is made, 
the covariance or its inverse intervenes; e.g., 6*5 for vector 5 really means 
Such requirements can always be tracked by the index summation 
convention, as the only way an upper index may be converted to a lower index 
or vice versa is with the covariance or its inverse. It is easiest to think 
of the covariance as being a unit matrix, in which case the group GL ( d ) of 
general linear transformations on spectral space is restricted to its subgroup 
0(d) of orthogonal transformations , which leave the covariance unchanged. 

In order to define the two new clusters, SPLIT must find two new covariance 

matrices, the difference between the two new cluster means, and the difference 

7 7 

between the two new proportions for a total of 2(d + d)/2 + d + 1 = (d + 1) 
variables. Covariance matrices are represented by their square roots, the 
"standard deviations," to preserve positive difiniteness. 
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The overall statistics of this combined distribution must match those of the 

given cluster, including the skewness and kurtosis. The mean and proportion 

of the given cluster are taken care of automatically, so there are d(d + l)/2 

equations each for the covariance (matching it to 1) and the kurtosis, and d 

2 

equations for the skewness, for a total of d + 2d equations, or one more 
unknown than there are equations. 


The method used to solve these equations in the current version of SPLIT is 
rather crude, and could be greatly improved. After a first approximation is 
made, the subroutine tries to minimize a quadratic form of the squared differ- 
ence between the three classes of statistics and their values for the current 
assignments of the independent variables, by using a steepest-descent type of 
algorithm. The quadratic form to be minimized is thus 


OBCOV || Z - e|| 2 + OBSKEW ||S - S|| 2 + OBKURT ||K - K|| 2 (47) 

where the circumflexed values refer to those calculated from the current val- 
ues of the independent variables using the equations from table I, and the 
unmarked values are those derived from the statistics of the cluster to be 
SPLIT. When K is referred to in SPLIT, the value is used with the normal 
distribution offset of [(2 + d) z] subtracted; this offset is the trace on 

4 

one pair of indices of 1 3Z '. Using the summation convention 


3Z 


4, 


= v r 

Sj kl 


Z ik Z j£ 


z z 

Ujk 


( '3Z 4, )Z* 


lij = }: 


j£ + + V = (d + 2) V 


OBCOV, OBSKEW, and OBKURT are arbitrary objective function coefficients 
defined in common. Written out, this objective function is: 

obj = OBCOV (gfi + ^- C - a 2 + ~~ r 2 - I } 2 

+ OBSKEW [g( 6TrD 2 + 2D 2 6 - c6 2 6) - S] 2 
+ OBKURTj 2gD 2 TrD 2 + g(D 2 ) 2 + g^^- 1 -) <5<S * 

- cg[<S 2 D 2 + (TrD 2 )6<5 t + 26 ( D 2 6 ) 1 + 2(D 2 6)5 t ] - kJ 2 (48) 
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2 

where c, t, D , c» and 6 are parameters of the two clusters as defined in 
table I and in section 2.6; g is (1 - c )/4, S is skewness, and K is sub- 
tracted kurtosis. The above expression is written In matrix notation in a 
coordinate frame where £ is the unit matrix (note the appearance of the unit 

matrix in the covariance term), and all matrix products, traces, vector inner 

2 t 

products, etc., are taken using £ as the metric. 5 is a scalar * 6 6 and 
66 t is a matrix. 

As a steepest descent technique is used, the derivatives of the objective 
function with respect to the independent variables are required. Note that 
the derivative with respect to a vector is a vector, and that the derivative 

with respect to a matrix is a matrix. Also, the notation (A,B) - AB + BA for 

matrices A and B is used to denote the anti commutator calculated by routine 
ACOM. The error terms in the objective function are written: 

E = £- £ = £- I; T = S - S; V = h' - K; E and V are matrices and T is a vector. 

For brevity, A for OBCOV, B for OBSKEW, and C for OBKURT are written. 

For the derivative, 

j = A [- | 5 t E6 + Tr(ED 2 )] 

A 

+ BT(- 1 1 - a6 2 6 ) 

+ cj- | [2Tr(VD 2 )TrD 2 + Tr(VD 4 )] 

+ (3cg - | b^6 t V5 + (| ? - gj[6 2 Tr(VD 2 ) 

+ 5 t V6TrD 2 + 4(V5 ) t (D 2 5 )j| (49) 

where 

[VS ) t (D 2 6 ) = 26*^, D 2 }6 (50) 
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2AgES + gB[-2cT(S5 + (TrD 2 - c<$ 2 + 2D 2 )T] 

+ 2gC[(h - 2cTrD 2 )VS - cTr(VD 2 )6 - 2c{V,D 2 }<$] (51) 

where 


h = 



- 1 


2 


If we write 


then 


\ = A | E + gB(T6 1 + T6 t + ST fc ) 

+ gcf(2TrD 2 - c* 2 ) V + (2TrD 2 - csSs )I 


+ fV,D 2 } - 2c[(V<5)6 t + 6(V6) t 


]l 


(52) 


3obj 

Ba 


Bobj 

cT 


■ i -^1 


(53) 


The body of SPLIT is taken up with calculating the above objective function 
and its derivative. To handle the shortage of one available equation, the 
system is allowed to approach a minimum naturally, but the objective function 
is multiplied by (1 + c • GAMCEN)*, which tends to make the weights of tne 
two clusters equal. 


Internal variable names, temporary variables, and accumulators associated 
with variables used in this part of SPLIT are listed in table III. 


*GAMCEN is a control parameter defined in a BLOCK DATA subroutine. 
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The actual steepest-decents procedure is not complicated. There is a certain 
ambiguity in any steepest-descents procedure in that the derivative vector is 
of a different type from the dX needed as an increment. This is an example 
of the mathematics elaborated in section 2.5; essentially, any steepest- 
descents procedure yields different results in different coordinate systems. 

In the case of the SPLIT routine, there are natural coordinate systems within 
the vectors and within the matrices (due to the transformation of E to 1), so 
the only ambiguity concerns the relative sizes of the scalar, vector, and 
matrix terms. These are handled by means of control parameters GAMMET, DELMET, 
and SGTMET, so that the square of the full derivative is 


GRADSQ = GAMMET 


(iofel ) 2 + DELHET .(=$) 


+ SGTMET 


/ 3obj\ 2 

+ (m.) 

\ *°l 

\ ^ / 


(54) 


In the steepest-descents procedure, SPLIT first calculates the objective func- 
tion at its new test point. If this is an improvement over the last test 
point's objective function, the derivatives are calculated and a steepest- 
descent step is taken to a new point using these derivatives. (This is always 
done at the first point.) If the new point is not an improvement, the deriv- 
ative is not calculated, and the step size is made smaller than the step just 
made, and points backward along the old step. Conceptually, the new point is 
rejected in this case, and the program tries a new point in the same direction 
but closer to the previous origin point. 


The step size is controlled by the variable SSIZ, using the temporary SHRINK. 
If the new point is an improvement (which is the change ratio of SSIZ), the 
step size is increased by an amount dependent on the expected and actual 
changes in the objective function, and bounded below by /ExmNsq and above 
by EXMAX, where EXMNSQ and EXMAX are control parameters. If the test point is 
not an improvement, the new point is taken at the minimum of the parabola 
defined by the test and old objective function values, and its derivative at 
the old point. 
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The iteration is terminated by a system using the variables PCTIMP, THIMP, 
and DOBFAC, and the control parameters DAMP, TIMQ, and TIM1. Briefly, PCTIMP 
is a running average estimate of the fractional improvement in the objective 
function per iteration, where DAMP controls the length of the running average. 
THIMP is the THeoretical IMProvement for the step (not greater than the new 
objective function). After each iteration, the average improvement, PCTIMP*OBJ 
is compared with DOBPMS (Derivative of OBjective Per Millisecond), and if it 
falls short of the required value, iteration is terminated. DOBPMS is calcu- 
lated early in the program from the timing factors TIMO and TIM1 and from the 
number of channels, since the cost of each iteration depends on the number of 
channels. The purpose of this procedure is to balance off computer time 
against the value of a better solution to SPLIT. There is also a direct limit 
to the number of iterations using the control parameter ITERMX. 

After the iteration is complete, SPLIT rotates the solution back to the orig- 
inal coordinate system and builds the two new clusters. It was found necessary 
to enlarge the covariances of the two output subclusters to enable them to find 
the true distributions they were intended to match. This spreading of the 
clusters is necessary due to the "guess" character of SPLIT; otherwise, the 
generated clusters would be so far off the actual clusters they were intended 
to model that they would get no points, and would be eliminated eventually as 
having too small a proportion. Thus SPLIT adds to the covariance of each 
cluster an amount SPRED (z. . + 0.25.5 .), where SPRED is a control parameter. 

" vJ ' J 

The initialization of the steepest-descent procedure is based on a crude solu- 
tion to the splitting problem. The system is transformed to a coordinate 
frame where the overall covariance Z is the unit matrix. This is done by per- 
forming an eigen decomposition on z to diagonalize it, and then stretching or 
shrinking along each of the coordinate axes by the square root of the neces- 
sarily positive eigenvalues to make them 1. A rotation can diagonalize the 
kurtosis, leaving a unit covariance and a diagonal kurtosis. SPLIT uses this 
kurtosis-diagonal frame of reference. 
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For initialization, C is set to zero (actually 10’^) and the skewness is dis- 
regarded to first order. The direction of splitting is along the most negative 
eigenvalue of KURT. (If there is none, the splitting is in covariance rather 
than in mean, e.g., the cluster has a sharp peak and a broad base.) The length 
of the vector in this direction is obtained by solving a cubic equation using 
both the skewness and kurtosis along that direction; the skewness equation is 

p 

necessary to fix D in the given direction. The remaining components of the 
displacement vector are calculated from the skewness components, and the co- 
variance diagonals from them. The detailed equations used for the initiali- 
zation do not appear to be available at present other than in the code itself; 
however, they were derived from table I directly as described. Some variables 
used in the initialization and their meanings are given in table IV. 

SPLIT is a very crude routine, and could be much improved. It may also be a 
little slow for 16 channels, since many of the operations involved are cubic 
in the number of channels. This cubic behavior multiplied by the number of 
iterations can potentially be fairly expensive even though SPLIT is typically 
called only a few times during the execution of CLASSY. This can be regulated 
by the control parameters TIMO and TIM1 . 

Since SPLIT only generates a guess, it is possible that a much wider solution 
will suffice. In fact, the initial guess used by SPLIT may be adequate, which 
could be tested by making comparison runs with ITERMX = -1. Tests with other 
small values might be profitable as well. 

In any event, it should be possible to solve the equations used in SPLIT by a 
more ordinary, direct approach, rather than the somewhat roundabout steepest- 
descents method actually employed. The current version used this method only 
to make the coding direct; an earlier version of SPLIT was written using a more 
direct method of solution, but was judged too difficult to debug. It was 
essentially an extension of the initialization method used in the current 
SPLIT. 
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SPLIT currently divides a cluster into two components. However, analysis indi 
cates that if the kurtosis has more than one negative eigenvalue, then the 
distribution, if made of normals, must have three or more components. No 
recognizance of this is made in the current code. Any later version should 
address this, or at least gather statistics to determine if the case is 
important. 
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TABLE III- VARIABLES AND ACCUMULATORS USED IN SPLIT 


FORTRAN variable 

Meaning 

GAM 

c 

GP 

(1 +.0/2 

GM 

(1 - 0/2 

AA 

„ . 1 - c 2 
9 4 

BB 

. 3c 2 - 1 

h ' 2 

TRD 

Tr(D 2 ) 

DELSQ 

6 2 

R 

d 2 s 

DUM 

D^ (temporary) 

TMG 

Tr(D 2 ) - c5 2 

BBP 

h6 2 - cTr(D 2 ) 

GAM2 

2c 

GAMDEL 

c6 2 

ERCOV 

E 2 

ERSKEW 

T 2 

ERKURT 

V 2 

OBJ 

objective « ERCOV*OBCOV + ERSKEW*OBSKEW 
+ ERKURT*OBKURT 

SPROA 

(TrD 2 )6 + 2D 2 6 - c5 2 5 

DEL 

6 

SG 

a 

TAU 

T 

ERE 

2 

a (temporary) 

VER 

2 

t (temporary) 

T 

T 
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TABLE III.- Continued. 


FORTRAN variable 

Meaning 

ERE 

E (temporary) 

VER 

V (temporary) 

GMCF 

1 + c 2 *GAMCEN 

DKURT 

g*OBKURT 

DKRTGM 

cg*OBKURT 

DSKEW 

g*OBSKEW 

D05 

-2cg * OBKURT 

ERED 

E5 

DSQT 

D 2 T 

VDEL 

V6 

DUM 

2 

{V, 0 } (temporary) 

VDSQD 

(V, D 2 } 6 

TDEL 

J l 6 

DVDEL 

5 t V6 

TSPROA 

T * SPROA 

TVDSQ2 

Tr(VD 2 ) 

TPVD 

gT * OBSKEW - 2cgV6 * OBKURT 

0C0V2 

2q OBCOV 

02 

2g [h 5 t V 5 - | Tr(VD 2 )] OBKURT 


- 2cg T<5 OBSKEW 

03 

g(TrD 2 - c. 1 ! 2 ) OBSKEW 

D5 

2g(h3 2 - cTrD 2 ) OBKURT 

06 

-4gc OBKURT 

SGI 

OBCOV 

TAUl 

- OBCOV 

UNIDSQ 

gT^ OBSKEW + gQ Tr(VD 2 ) - c5 V ] OBKURT 
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TABLE III- Concluded. 


FORTRAN variable 

Meaning 

OD3 

g(TrD 2 - cS 2 ) OBKURT 

DERED 

5 t E5 

DVD202 

6 t {V, D 2 }6 

DDEL 

2 obj/3 5 

TEREDQ 

Tr(ED 2 ) 

TR2VD4 

2Tr(VD 4 ) 

VER 

3obj/3a (temporary, note square) 

ERE 

2 

3obj/3t (temporary, note square) 

DSG 

aobj/ocr 

DTAU 

3obj/3T 

DGAM 

3obj/3c 

SUMV 

( 3obj/36 ) 2 

SIMM 

(3obj/3a) 2 + (3obj/3x) 2 

GRADSQ 

(square of total derivative - see text) 

GRADRT 

/GRADSQ 
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TABLE IV - INITIALIZATION VARIABLES USED IN SPLIT 


Variable 

Meaning 

EVURT 

Kurtosis eigenvalues 

IBES 

Index of most negative eigenvalue 

AMXVAL 

Most negative eigenvalue 

TRN 

Tr(D 2 ) 

TRSQ 

TRN**2 

RT 

Essentially /32 tVURt 

RTSM 

Z RT + illMii . TRN \ 

\ 3 6 IBES / 

TCOF 

d + 4 + (1/3) 

ORTSM 

j: ~r - (~j) where In RTSM, TCOF, and 
ORTSM the term In parentheses appears 
only If there Is a most negative 
eigenvalue 

FRT 

^ y AMXVAL 

DELIN 

The displacement along the most negative 
eigenvalue 

DBES 

2 

D along most negative eigenvalue 

ERT 

A new approximation to RT 

BTR 

Temporary coefficient 

DELFAC 

Temporary coefficient 
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3.5 DENCAL 


This routine adjusts CIN and CTOT for a cluster to change the proportion for 
that cluster by a given ratio. This adjustment Is necessary because of the 
notatlonal convention that makes a cluster's proportion the product of Its 
parent cluster's proportion with Its proportion as calculated from Its own 
proportion system (CIN and CTOT). If the proportion system design were changed 
to work with absolute proportions, OENCAL would be unnecessary. DENCAL also 
makes the mandatory calculations of the dependent and old variables DROP, OPROP, 
and ODEN. 

The proportion of a cluster relative to Its parent Is calculated via 

PR0P • nfrrm 

where Wp is the weight of the parent cluster. This form is necessary to keep 
the proportions correct even if a cluster is skipped via the Monte Carlo sys- 
tem in STATIS. 

DENCAL keeps CIN constant in making the proportion change, making only the 
changes required by the change in parent cluster and Wp. 

If RATIO is the change In proportion, we have PROP' * RATI0*PR0P or 

CIN _ n A T 7 A CIN /C C X 

Tr F "- ' C T 0 T' ' MTI ° wp 'TTC T (55) 

which becomes 

CTOT' = W£ - (Wp - CT0T)/RATI0 (56) 

All the variables correspond to those in the program, except W(KF) * Wp , 

OLW = Wp. 
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3.6 APRIOR 


APRIOR is a short routine which returns the natural logarithm of the additional 
cutoff factor needed for one more class, times any volumetric factors needed 
to normalize the integrals over the continuous parameters. The cutoff factor 
is a function C(m) of m, the number of classes, such that 

2 C(m) = 1 (57) 

l=m 


It is necessary to multiply the overall probability by such a factor so it can 
be normalized even if the number of classes is unknown. Additional factors 
may be necessary to normalize various integrals over the continuous parameter 
space. 

The results given by CLASSY do net appear overly sensitive to the value given 
to APRIOR, and this should be generally true except in the case of very sta- 
tistically sensitive problems. As of this writing, no controlled study has 
been made of the effects of changing the values returned by APRIOR. It is 
possible that redefining the clustering problem handled by CLASSY without the 

use of the Bayesian model approach would clarify the range of values allowed 

2 

for APRIOR without getting the one cluster per point (or n clusters for n 
points) divergences in the clustering behavior of the program. 

At present, APRIOR returns a value of VFAC*MQ + BIAS, where VFAC is a control 
parameter giving the dimensionality dependent volumetric factors (as a log- 
arithm), and BIAS gives the logarithm of the overall cluster cutoff factor: 

C(m) = (constant) exp(-m*BIAS) (58) 

It is possible that VFAC, the dominant term, could be made as high as -log 2, 
or even -(log 2)/MQ, since when a cluster is split, the subclusters have 
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effectively twice their own volume for their means to move around in. It 
would certainly be useful to test CLASSY with some fixed and presumably arti- 
ficial data set, and find what values of VFAC and BIAS are required for the 
program to give anomalous behavior. Running close to these limits might make 
the algorithm more sensitive. It should be noted that the terms represented 
by APRIOR are quite small compared to those represented in the product prob- 
ability over all the points, simply because there are so many more points than 
clashes in most cases. Thus it is probable that the results are almost 
entirely independent of APRIOR, as they generally should be. 

3.7 ISPLIT 

1ST IT is a short logical function used during the mapping and output stages 
of che program. Although CLASSY naturally uses fractional assignment of each 
point to a number of classes, during the mapping stage a decision must be made 
as to which single class each pixel should be assigned. A decision must also 
be made as to whether to use a parent cluster or its subclusters; this latter 
decision is the function of ISPLIT. 

ISPLIT returns .TRUE, for a cluster if the cluster has subclusters, and either 
the likelihood ratio favors the subclusters, or the subclusters are older than 
the parent cluster. Otherwise ISPLIT returns .FALSE. The second proviso 
concerning the age of the cluster is necessary to avoid selecting newly JOINed 
parent clusters over their subclusters. Such parents have an advantage over 
their subclusters due to the APRIORi factor in SPFAC, and would be automatic- 
ally selected even if the subclusters were a much better fit to the data. 

Thus the second proviso in ISPLIT forces a decision in favor of the subclus- 
ters of a JOIN until the new parent cluster has succeeded in eliminating the 
subclusters. 

3.8 El GROT 

EIGROT is a general eigenvector-eigenvalue routine for symmetric matrices used 
primarily by SPLIT. It calls system routines, and is thus computing-system 
dependent. The routines used for EIGROT must handle the case of equal eigen- 
values correctly. 
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4. CONCLUSIONS 


4.1 TIMING AND OPTIMIZATION 

Although no detailed work has yet been done on the timing and optimization 
characteristics of CLASSY, a few of the main features are obvious. This anal- 
ysis will not examine the time consumed in scrambling the data, but only the 
timing of the CLASSY algorithm. The scrambling time should be measured, but 
any reasonable coding should make it much less than that consumed by CLASSY 
itself. The main timing at present should be the product of number of points 
times number of channels squared times number of clusters times number of 
iterations. The quadratic dependence on the number of channels should be 
noted in particular. CLASSY was originally designed as a research program, 
therefore many optimizing features were omitted. 

The first loop of STATIS is executed for every point times every cluster. The 
second loop is executed probably two or three times per data point. ADJUST 
is called once per WADJ interval, typically every 100 or 200 points, and the 
discrete transformation subroutines are called even less often, typically a 
few times per run. The dominant factors in both the first and second loops of 
STATIS are quadratic in the number of channels. ADJUST contains some opera- 
tions which are cubic in the number of channels (matrix inversion, etc.) but 
these are executed infrequently and therefore should consume little machine 
time. With the possible exception of SPLIT, the discrete transformations 
consume negligible machine time. 

It can be seen then, that perhaps 80 percent of more of the execution time 
for CLASSY is spent in the first loop, and of that time perhaps one-half (more 
for 16 channels) is spent in the routine DOTSQ. Rewriting that routine as a 
straight line (no indexing) assembly routine could speed up the algorithm by 
30 percent or more. Expanding all the vector routines and writing a low- 
precision assembly language routine for the function XP should double the 
speed of the system. 
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It is apparent that CLASSY is expending a great deal of effort in calculating 
the probabilities of points in classes in which they have very low probabil- 
ities. If these clusters could be eliminated from consideration without 
introducing a bias, the performance of CLASSY should be improved by a large 
factor. The second-loop processing of these clusters is eliminated by the 
Monte Carlo system. By moving parts of that system into the first loop, or 
making similar changes, timing could be considerably improved. 

One simple method of eliminating the quadratic form evaluation for every clus- 
ter is to maintain a lower bound of the form relative to some standard quad- 
ratic form which is evaluated only once. Or, better, the system could be 
rotated to the frame where this form is unity. The only alteration that this 
would require would be to the Shepherd's correction terms in ADJUST; the non- 
invariant JOIN selection would be improved. Then, since the distance to a 
point would be always greater than (x - y) relative to the standard form, 
most of the expensive quadratic form evaluation could be eliminated. This 
timing could be improved even more if techniques dividing the feature space 
were used. Additionally, the bounds for a cluster could take into account 
those of its subclusters, so that they could be eliminated with the main clus- 
ter. In any event, the extra overhead for maintaining the bound value should 
be small, and probably could be included in ADJUST rather than the updating 
portion of STATIS. Together with the optimized coding of the inner routines, 
this quick classification scheme might accelerate CLASS* by as much as a 
factor of 10. The other principal way to speed up CLASSY is to reduce the 
number of iterations necessary to reach a satisfactory solution. 


CLASSY currently takes a large, single cluster of low weight for its initial 
condition. If a run sufficiently similar to the given run is available, a 
spread out version of that run's final clusters could be used. However, in 
the absence of a good signature extension this might not lead to significant 
Improvement. A more general procedure might be to make several iterations 
using a subset or linear combination of the channels, and then switching to 
the full set for perhaps two passes. The two modes would be bridged by a 
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pass or a fraction of a pass using the old set of features to calculate the 
probabilities, and the new set in updating the class statistics, basically a 
variant of iterate mode. This technique should be short and easy to imple- 
ment and should nearly double CLASSY's speed on the 16-channel problem. (Note 
that WADJ must grow quite rapidly to go around the whole data set by the 
second pass.) 

The three major techniques just described would be quite simple to implement, 

and should speed CLASSY up by a factor of 15 to 20. Minor techniques for 

optimization might improve it by 40 percent more. 

Some minor techniques for optimization include: 

1. Making WADJ a function of cluster stability or rate of change 

2. Variation of PLIM and MONTE in the Monte Carlo system 

3. Finding good values for the acceleration parameters in ADJUST 

4. Improving the proportion system 

5. Newton's method for the convergence of the basic likelihood equations or 
Newton's method corrections might be applied for overlapping clusters 
(This is probably not warranted in the present version.) 

6. Better methods for guessing the component distributions after a SPLIT 

7. The selection of points highly dependent on the parameters to be placed 
on a temporary file and reprocessed intensively 

4.2 MODIFICATIONS AND IMPROVEMENTS 

Qualitative changes which might be made in the underlying model of CLASSY are 
described in this section. CLASSY imposes on the data an underlying structure 
of a mixture of normal distributions, probably not completely valid for Land- 
sat data. In some cases CLASSY has given overlapping clusters as output, 
which, if they correspond to the same class, are a result of skewed or kur- 
totic distributions. Two simple modifications, allowing a skewness parameter 
and a radial function or "form factor" for each class as additional parameters 
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would eliminate these cases. Only the traceless part of the kurtosis would 
then be available to detect and organize the splitting of clusters. Coming 
closer to the real class of distributions appearing in Landsat data should 
improve the resolution of CLASSY and subsequent classification. The skewness 
might be fixed without modifying CLASSY itself by imposing a nonlinear coordi- 
nate transformation on the brightness space. 

Another change in the model for CLASSY would allow "bridges" between clusters, 
to contain points which are themselves mixed. This technique might combine 
well with the Newton's method system, but is probably less cost-effective than 
simply doing a field recognition. 

Numerous other modifications could be made to CLASSY. More complex models 
requiring fewer parameters per cluster are among the most general. Here clus- 
ters would be assumed to have equal covariances, sparse covariances, zero 
skewness, etc., until the statistics prove otherwise. However, the question 
of which type of model to use is complex and system dependent, and will not 
be explored further in this report. 

Finally, the actual classification, or assignment of points to clusters, used 
in the final stage of CLASSY was added only for the purpose of obtaining read- 
able maps. This point assignment procedure could be improved, if training 
data were available to label the clusters. In this case, the proper procedure 
would be to add the probabilities for clusters having a common label before 
selecting the label with the maximum probability. 
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